Lab 5: Fitting Distributions
Find soulmates of the data
Introduction
In this tutorial you will learn how to be determine the distribution of your data. By the end of the tutorial you will be able to:
- calculate the four moments
- compare the distributions with the ideal ones
- draw quartile plots and qqplots
Getting Started
We will visualize two financial series and earthquakes datasets. To be able to follow you need:
- to download earthquakes.csv and findata.xlsx data uploaded on LEARN
- some packages including
tidyversemomentsqqtest
Reading and Preprocessing Data
Let’s read the findata and earthquakes datasets as below:
## # A tibble: 6 x 7
## Date MSFT AAPL DJI `S&P` Coca Pepsi
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-21 183. 317. 24474. 2949. 45.2 130.
## 2 2020-05-22 184. 319. 24465. 2955. 45.0 130.
## 3 2020-05-26 182. 317. 24995. 2992. 46.1 130.
## 4 2020-05-27 182. 318. 25548. 3036. 46.7 131.
## 5 2020-05-28 181. 318. 25401. 3030. 47.1 132.
## 6 2020-05-29 183. 318. 25383. 3044. 46.7 132.
The data contains information of four companies and two financial indices. We are interested in their daily returns, so we will define calc_return function which calculates return using today’s price and yesterday’s price, P_{n}/P_{n-1} -1. Besides :
calc_return <- function(ser){
ret <- ser[-1]/head(ser,-1) - 1
return(ret)
}
returns <- apply(findata[,-1],2, calc_return)
head(returns)## MSFT AAPL DJI S&P Coca
## [1,] 0.03571208 0.055556561 0.022255692 0.0144088553 0.044417687
## [2,] 0.01725028 -0.004784381 -0.008857952 -0.0079476008 -0.020689655
## [3,] -0.02543175 0.033653825 0.007321911 0.0047300507 0.004694986
## [4,] -0.01739026 -0.013954671 -0.001072728 -0.0007633938 0.007009345
## [5,] -0.02654705 0.066037776 0.009111015 0.0039897580 -0.002320482
## [6,] -0.02727101 -0.022122490 -0.019775601 -0.0135283550 -0.038371875
## Pepsi
## [1,] 0.039634249
## [2,] -0.019061631
## [3,] 0.005979122
## [4,] -0.002971792
## [5,] -0.007451517
## [6,] -0.030030054
The problem with financial series is that there are some outliers that reduces the quality of visuals. For example there are market crashs that is very unlikely but when it happens the daily return is around -35%. We can remove them using another function:
remove_outliers <- function(ser){
ser[ ser < quantile(ser,0.99) & ser > quantile(ser,0.01) ]
}
returns <- apply(returns, 2, remove_outliers)
returns <- data.frame(returns)
dim(returns)## [1] 8449 6
## MSFT AAPL DJI S.P Coca Pepsi
## 1 0.03571208 0.055556561 0.022255692 0.0144088553 -0.020689655 0.039634249
## 2 0.01725028 -0.004784381 -0.008857952 -0.0079476008 0.004694986 -0.019061631
## 3 -0.02543175 0.033653825 0.007321911 0.0047300507 0.007009345 0.005979122
## 4 -0.01739026 -0.013954671 -0.001072728 -0.0007633938 -0.002320482 -0.002971792
## 5 -0.02654705 0.066037776 0.009111015 0.0039897580 0.014510277 -0.007451517
## 6 -0.02727101 -0.022122490 -0.019775601 -0.0135283550 -0.008343189 -0.030030054
Second data that we wil use is earthquakes:
## 'data.frame': 23412 obs. of 21 variables:
## $ Date : Factor w/ 12401 levels "01/01/1967","01/01/1969",..: 35 102 134 236 268 301 368 465 501 536 ...
## $ Time : Factor w/ 20472 levels "00:00:03","00:00:04",..: 11785 9866 15448 16088 11612 11666 11604 19888 9901 9195 ...
## $ Latitude : num 19.25 1.86 -20.58 -59.08 11.94 ...
## $ Longitude : num 145.6 127.4 -174 -23.6 126.4 ...
## $ Type : Factor w/ 4 levels "Earthquake","Explosion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Depth : num 132 80 20 15 15 ...
## $ Depth.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Depth.Seismic.Stations : int NA NA NA NA NA NA NA NA NA NA ...
## $ Magnitude : num 6 5.8 6.2 5.8 5.8 6.7 5.9 6 6 5.8 ...
## $ Magnitude.Type : Factor w/ 11 levels "","MB","MD","MH",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Magnitude.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Magnitude.Seismic.Stations: int NA NA NA NA NA NA NA NA NA NA ...
## $ Azimuthal.Gap : num NA NA NA NA NA NA NA NA NA NA ...
## $ Horizontal.Distance : num NA NA NA NA NA NA NA NA NA NA ...
## $ Horizontal.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Root.Mean.Square : num NA NA NA NA NA NA NA NA NA NA ...
## $ ID : Factor w/ 23412 levels "AK11232962","AK11248623",..: 2580 2581 2582 2583 2584 2585 2586 2587 2711 2588 ...
## $ Source : Factor w/ 13 levels "AK","ATLAS","CI",..: 5 5 5 5 5 5 5 5 6 5 ...
## $ Location.Source : Factor w/ 48 levels "AEI","AEIC","AG",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ Magnitude.Source : Factor w/ 24 levels "1000","1009",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ Status : Factor w/ 2 levels "Automatic","Reviewed": 1 1 1 1 1 1 1 1 1 1 ...
The data has dates that are read as factors. We need them as dates. Besides we will need Datetime to calculate time difference between two earthquakes, also Yearmonth to be able to count number of events each month:
library('lubridate')
eqs <- read.csv('data/earthquake.csv')
eqs$Date <- as.Date(eqs$Date, format = '%m/%d/%Y')
eqs <- mutate(eqs,
Datetime = as_datetime(paste0(Date, ' ', Time)),
YearMonth = paste0(year(Date), '-', month(Date)))
dim(eqs)## [1] 23412 23
## Date Time Latitude Longitude Type Depth Depth.Error
## 1 1965-01-02 13:44:18 19.246 145.616 Earthquake 131.6 NA
## 2 1965-01-04 11:29:49 1.863 127.352 Earthquake 80.0 NA
## 3 1965-01-05 18:05:58 -20.579 -173.972 Earthquake 20.0 NA
## 4 1965-01-08 18:49:43 -59.076 -23.557 Earthquake 15.0 NA
## 5 1965-01-09 13:32:50 11.938 126.427 Earthquake 15.0 NA
## 6 1965-01-10 13:36:32 -13.405 166.629 Earthquake 35.0 NA
## Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1 NA 6.0 MW NA
## 2 NA 5.8 MW NA
## 3 NA 6.2 MW NA
## 4 NA 5.8 MW NA
## 5 NA 5.8 MW NA
## 6 NA 6.7 MW NA
## Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## Root.Mean.Square ID Source Location.Source Magnitude.Source
## 1 NA ISCGEM860706 ISCGEM ISCGEM ISCGEM
## 2 NA ISCGEM860737 ISCGEM ISCGEM ISCGEM
## 3 NA ISCGEM860762 ISCGEM ISCGEM ISCGEM
## 4 NA ISCGEM860856 ISCGEM ISCGEM ISCGEM
## 5 NA ISCGEM860890 ISCGEM ISCGEM ISCGEM
## 6 NA ISCGEM860922 ISCGEM ISCGEM ISCGEM
## Status Datetime YearMonth
## 1 Automatic 1965-01-02 13:44:18 1965-1
## 2 Automatic 1965-01-04 11:29:49 1965-1
## 3 Automatic 1965-01-05 18:05:58 1965-1
## 4 Automatic 1965-01-08 18:49:43 1965-1
## 5 Automatic 1965-01-09 13:32:50 1965-1
## 6 Automatic 1965-01-10 13:36:32 1965-1
world <- map_data("world")
ggplot(data = world, aes(x=long, y=lat, group=group)) +
geom_polygon(fill = "white", colour = "black") +
geom_point(data=subset(eqs,Date>'2010-01-01'), aes(x=Longitude, y=Latitude, group=1, size = Magnitude), colour='firebrick', alpha=.1)+
coord_quickmap() +
theme_void()Continuous Distributions
There are a couple of discrete distributions each are for different random variables:
Normal
\[X \sim N(\mu,\sigma^2) \Rightarrow f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac12(\frac{x-\mu}{\sigma})^2}\]
Exponential
\[X \sim exp(\lambda) \Rightarrow f(x) = \lambda e^{-\lambda x}\]
Uniform
Log-Normal
Beta
Gamma
Weibull
Quantile Plots - Chef’s Knife
normal <- data.frame(X = rnorm(200))
logNormal <- exp(normal) # A log normal RV is a varaible which becomes normal when you transform with log
g1 <- ggplot(normal, aes(y=X)) +
geom_histogram(aes(x = ..density..), colour = 'white', bins=20) +
geom_boxplot(aes(x = -.1), width=.1)
g2 <- ggplot(logNormal, aes(y=X)) +
geom_histogram(aes(x = ..density..), colour = 'white', bins=20) +
geom_boxplot(aes(x = -.25), width=.25)
gridExtra::grid.arrange(g1,g2,ncol=2)normal <- data.frame(X = rnorm(100, 10, 1))
ggplot(normal, aes(y=X)) +
geom_histogram(aes(x = -..density..), alpha = .7, colour = 'white', bins=20) +
geom_point(aes(y=sort(X), x=ppoints(X)), colour='firebrick', alpha = .3, size =3 ) +
geom_boxplot(aes(x = 1.25), width=.25) logNormal <- exp(normal)
ggplot(logNormal, aes(y=X)) +
geom_histogram(aes(x = -..density..), alpha = .7, colour = 'white', bins=20) +
geom_point(aes(y=sort(X), x=seq(0,1, length.out = length(X))), colour='steelblue', alpha = .3, size =3 ) +
geom_boxplot(aes(x = 1.25), width=.25) set.seed(157)
betaDist <- data.frame(X = rbeta(100, 6, 4))
ggplot(betaDist, aes(y=X)) +
geom_point(aes(y=sort(X), x=seq(0,1, length.out = length(X))), colour='orange', alpha = .3, size =3 ) +
geom_boxplot(aes(x = 1.25), width=.25) ggplot(iris, aes(y=Petal.Length)) +
geom_point(aes(y=sort(Petal.Length), x=seq(0,1, length.out = length(Petal.Length))), colour='darkolivegreen3',
alpha = .3, size =3 ) +
geom_histogram(aes(x = -..density..), alpha = .7, colour = 'white', bins=15) +
geom_boxplot(aes(x = 1.25), width=.25) ind <- seq(0,1,length.out = 1000)
binom <- data.frame(ind=ind, X = sort(rbinom(1000, 50,.95)))
geom <- data.frame(ind=ind, X = sort(rgeom(1000, .4)))
pois <- data.frame(ind=ind, X = sort(rpois(1000, 3.22)))
normal <- data.frame(ind=ind, X = sort(rnorm(1000)))
logNormal <- data.frame(ind=ind, X = sort(exp(normal$X)))
betaDist <- data.frame(ind=ind, X = sort(rbeta(1000, 3, 3)))
dists <- data.frame(rbind(normal,logNormal,betaDist, binom,geom,pois))
dists$Distribution <- unlist(lapply(c('N(0,1)','LogNormal(0,1)','Beta(3,3)','Binom(50,0.95)','Geom(0.95)','Poisson(3.22)'), function(n) rep(n,1000)))
dists$Distribution <- factor(dists$Distribution, levels = c('N(0,1)','LogNormal(0,1)','Beta(3,3)','Binom(50,0.95)','Geom(0.95)','Poisson(3.22)'))
ggplot(dists,aes(y=X, x=ind, colour=Distribution)) +
geom_point(,alpha = .3, size =3 ) +
facet_wrap( ~ Distribution, ncol=3, scale='free') +
theme(legend.position = 'none') +
ggtitle('Quantile Plots of Different Distributions')Identifying Distributions
Moments - Measures for Distribution Characteristics
There are two moments that are important in life. The moment you are born and the moment you find out why… Nevermind.
There are four moments that are important in summarizing characteristics of distributions. These are first, second, third and fourth moments which correspond to:
- Mean: The average
- Variance: Square of standard deviation
- Skewness: The asymmetry
- Kurtosis: Thickness of tails
You will learn moments, moment generating function etc. in detail. R doesn’t list them for you, so we will use moments package for it.
The moments can have certain values specific to each distribution. For example 3rd and 4th moment of Normal is 0 and 3 respectively. For other distributions you can sometimes see excess kurtosis, which is benchmarked to kurtosis of normal distribution:
- skewness and ex. kurtosis of Exponential is 2 and 6 respectively.
- skewness and ex. kurtosis of Uniform is 0 and -6/5 (no kidding) respectively.
- skewness and ex. kurtosis of Poisson is \(\sqrt \lambda\) and \(\lambda^{-1}\)
You can check them from Wikipedia by searching the distribution. On the right hand side the table summarizes these properties.
We will now generate random variables and calculate their moments. Also we define fourmoments function to summarize the four moments:
library('moments')
set.seed(156)
dists <- data.frame('Normal' = rnorm(1000),
'Exponential' = rexp(1000,3),
'Uniform' = runif(1000),
'Beta' = rbeta(1000, 6, 4))
fourmoments <- function(rv){
c('Mean' = mean(rv),
'Variance' = var(rv),
'Skewness' = skewness(rv),
'Kurtosis' = kurtosis(rv))
}
fourmoments(dists$Normal)## Mean Variance Skewness Kurtosis
## -0.01634422 1.00645313 -0.04879710 2.95845771
We will use apply to apply the function to all columns:
## Normal Exponential Uniform Beta
## Mean -0.01634422 0.3266587 0.50267007 0.60474814
## Variance 1.00645313 0.1046882 0.08356373 0.02025603
## Skewness -0.04879710 2.1312615 0.02188583 -0.26241843
## Kurtosis 2.95845771 10.1662499 1.79866096 2.78964164
The four moments are close to the ideal ones but there are small deviations. For example Kurtosis of Normal is 2.95 as opposed to 3 whereas Uniform is 1.8 as it is for the ideal 3-6/5 = 1.8. But we can see the deviation is much larger for Exponential, the kurtosis for ideal distribution is 9 (or excess kurtosis is 6) whereas we see 10. This is not a big difference because a small number of unlikely events can make yor data leptokurtic (thick tails).
If we generate larger numbers it will get closer to the ideal moments:
set.seed(157)
dists <- data.frame('Normal' = rnorm(10000),
'Exponential' = rexp(10000,3),
'Uniform' = runif(10000),
'Beta' = rbeta(10000, 6, 4))
apply(dists,2,fourmoments)## Normal Exponential Uniform Beta
## Mean -0.009463987 0.3343949 0.50011903 0.59707382
## Variance 1.005879410 0.1141077 0.08367487 0.02166831
## Skewness -0.004849413 2.0193060 -0.01112597 -0.17851546
## Kurtosis 2.933946209 8.9445674 1.80793466 2.55468255
Quantile Comparisons
g1 <- ggplot(returns) +
geom_histogram(aes(y = AAPL, x=..count..), bins = 20, colour = 'white', fill = 'firebrick') +
geom_histogram(aes(y = MSFT, x=-..count..), bins = 20, colour = 'white', fill = 'steelblue') +
labs(y = "Returns") +
annotate('text',x = -1000, y = 0.05, label='MSFT') +
annotate('text',x = 1000, y = 0.05, label='AAPL')
g2 <- ggplot(returns) +
geom_point(aes(y = sort(AAPL), x=ppoints(AAPL)), colour = 'firebrick', alpha=.2) +
geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)), colour = 'steelblue', alpha=.2) +
labs(x = 'Probability Points', y= 'Sample Quantiles')
grid.arrange(g1,g2,ncol=2)On the left we can see the the distributions back2back. Microsoft returns are more compressed to the center whereas Apple returns have thicker tails. It can be read from the quantile plots. Between returns -0.02 and 0 we can see Microsoft quantiles are more flat than Apple, between -0.04 to -0.02 Apple is flatter. Same for numbers greater than 0. Apple is more risky than Microsoft.
ggplot(returns) +
geom_point(aes(y = sort(AAPL), x=ppoints(AAPL)), colour = 'firebrick', alpha=.2) +
geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)), colour = 'steelblue', alpha=.2) +
geom_point(aes(y = sort(Coca), x=ppoints(MSFT)), colour = 'orange', alpha=.2) +
geom_point(aes(y = sort(Pepsi), x=ppoints(MSFT)), colour = 'purple', alpha=.2) +
labs(x = 'Probability Points', y= 'Sample Quantiles') Pepsi and Coca cola group are less risky than the tech companies. Their distributions are quite closer to each other. What do the moments say?
## MSFT AAPL DJI S.P Coca Pepsi
## Mean 0.0011 0.0011 0.0004 0.0004 0.0005 0.0005
## Variance 0.0003 0.0005 0.0001 0.0001 0.0002 0.0002
## Skewness 0.1583 0.1245 -0.1670 -0.1900 0.1177 0.0708
## Kurtosis 3.7222 3.6565 3.9586 4.0470 3.6895 3.7874
Tech companies on the average has higher return. Also their variance (risk^2) is higher with Apple’s risk is the highest. All the companies are right skewed meaning that the distribution is in favour of negative returns (there are more numbers on the left than right.). All the data are leptokurtic, the tails are thick.
The sample size is large. In practics the we don’t always have this amount of data. Let’s check the last 300 days:
g1 <- ggplot(tail(returns,300)) +
geom_histogram(aes(y = AAPL, x=..count..), bins = 20, colour = 'white', fill = 'firebrick')+
geom_histogram(aes(y = MSFT, x=-..count..), bins = 20, colour = 'white', fill = 'steelblue') +
labs(y = "Returns") +
annotate('text',x = -50, y = 0.05, label='MSFT') +
annotate('text',x = 50, y = 0.05, label='AAPL')
g2 <- ggplot(tail(returns,300)) +
geom_point(aes(y = sort(AAPL), x=ppoints(AAPL)), colour = 'firebrick', alpha=.2) +
geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)), colour = 'steelblue', alpha=.2)
grid.arrange(g1,g2,ncol=2)Apparently Apple is continuing to be risky, both in positive and negative directions.
Let’s focus on Microsoft. Is it Normal? To check it we will generate Normal numbers with the same mean and variance:
series <- tail(returns$MSFT, 300)
avgRet <- mean(series)
sdRet <- sd(series)
probs <- ppoints(series)
q.normal <- qnorm(probs, avgRet, sdRet)
ggplot(tail(returns,300)) +
geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)) ,colour = 'steelblue', alpha = .2) +
geom_line(aes(y = q.normal, x=ppoints(MSFT))) +
labs(x = 'Probability Points', y= 'Sample Quantiles') The black line is the ideal distribution. We can see that for positive numbers the blue points are closely following the black line but for negative numbers it deviates more. Also for the negative tail (-0.050, -0.025) the blue points are higher than the black line which translates into the negative extreme returns are more likely than normal.
qqplots
Comparing points with their ideal distribution is effectively showing where the numbers exactly devieates, and how many numbers that do so. One visualization that is obtained from the same philosophy is quantile-quantile plots. Here instead of plotting quantiles and overlapping, we discard probability points and place y-axis of both series (quantiles) to x and y axes. Also for qqplot you can use standard normal. The only difference will be x axis tick labels:
series <- tail(returns$MSFT,300)
probs <- ppoints(series)
q.normal <- qnorm(probs)
ggplot() +
geom_point(aes(y = sort(series), x= q.normal) ,colour = 'steelblue', alpha = .2) +
geom_abline(aes(slope=sd(series), intercept = mean(series))) Good news is we don’t have to write too much code for qqplot if we want to check normality. We can use qqnorm and qqline as below:
qqnorm(tail(returns$MSFT, 300), col=adjustcolor('steelblue', .3), pch=16)
qqline(tail(returns$MSFT, 300))The points deviate from the black line on the tails. For right tail they are quite close to the line, but on the left tail they deviate much.
qqtest Package
Our Statistics faculty is working hard to help you engineers to visualize things. Here is a package written by a University of Waterloo professor.
In the previous plot we saw some points are deviating but some far, some not. Some are just few some are quite a few. We can add envelope to check which are they statistically tolerable:
The left tail points deviate more than tolerable range. This means we can reject that the Microsoft daily return series are normal, especially for their left-tail behaviour.
In finance theory the risk must be chi-squarred RV. To check it we can calculate the deviation from the mean and check whether it is chi-squared:
series <- tail(returns$MSFT,300)
sse <- (series - mean(series))^2
qqtest(sse, dist='chi-squared', df=1)It is hard to say it is chi-squarred.
series <- tail(returns$MSFT,300)
sse <- (series - mean(series))^2
probs <- ppoints(sse)
q.dist <- qweibull(probs,.5,mean(sse)-.0001)
q.exp <- qexp(probs,1 / mean(sse))
# q.exp <- qexp(probs,1/mean(sse)) # exponential is a special case of Weibull
ggplot() +
geom_point(aes(y = sort(sse), x= probs) ,colour = 'steelblue', alpha = .2) +
geom_line(aes(y = q.dist, x= probs), alpha = .5) +
geom_line(aes(y = q.exp, x= probs), colour = 'firebrick') The data is not exponential either. But by tweaking a little, we could fit a Weibull distribution.
Earthquake Data
Number of Earthquakes
Now we can ask whether the number of earthquakes are poisson and the time between two events is exponential. To this end we will aggregate the earthquakes data. But we will focus on destructive ones, i.e. Magnitude >= 6.5:
totalEqs <- subset(eqs, Type == 'Earthquake' & Magnitude >=6.5) %>%
group_by(YearMonth) %>% summarise(nquakes = n())
lambda <- mean(totalEqs$nquakes)
nmax <- max(totalEqs$nquakes)
poisDist <- data.frame(x=0:nmax, p = dpois(0:nmax, lambda))
ggplot(totalEqs) +
geom_histogram(aes(x = nquakes, y=..density..), colour = 'black', alpha=.2, bins=(1+max(totalEqs$nquakes))) +
geom_point(data=poisDist, aes(x=x, y = p)) +
geom_line(data=poisDist, aes(x=x, y = p))The data is pretty much Poisson with some differences. There is no 0 earthquakes and there is a pile up around n=10. These differences might be because we are using an international data, so there is a lot variety between locations and their jeological structures.
## Mean Variance Skewness Kurtosis
## 3.841137 5.584436 1.274698 5.139980
The moments are different than the ideal ones where the four moments must be \(\lambda, \lambda, \sqrt\lambda, 1/\lambda\).
Days Passed
Now we can calculate the time difference between each two events:
subdata <- subset(eqs, Type == 'Earthquake' & Magnitude >=6.5)
dayPassed <- diff(subdata$Datetime) / (3600*24) # seconds passed divided by number of seconds in a day
dayPassed <- as.numeric(dayPassed) # converted a date-time object to numeric
ggplot() +
geom_histogram(aes(x = dayPassed), colour = 'black', alpha=.2) library('plotly')
rate <- 1/mean(dayPassed)
probs <- ppoints(dayPassed)
q.exp <- qexp(probs, rate)
q.wbl <- qweibull(probs, .9, 1/rate)
q.gmm <- qgamma(probs, 1.1, rate)
g <- ggplot() +
geom_point(aes(y = sort(dayPassed), x=probs), colour = 'steelblue', alpha = .2) +
geom_line(aes(y = q.exp, x=probs)) +
geom_line(aes(y = q.wbl, x=probs), colour = 'brown') +
geom_line(aes(y = q.gmm, x=probs), colour = 'orange') +
labs(x = 'Probability Points', y= 'Sample Quantiles')
ggplotly(g)Pretty much overlapping except the probability is a bit higher on the tails.
## Mean Variance Skewness Kurtosis
## 8.265258 88.936180 2.146942 9.893428
qqplot with envelops also verifies there is a strong evidence that the distribution is exponential. Besides the skewness and kurtosis are 2.15 and 9.89 close to the ideal ones, 2 and 9.
Magnitude
What about the magnitudes of each earthquake?
subdata <- subset(eqs, Type == 'Earthquake' & Magnitude >=6.5)
ggplot(subdata) +
geom_histogram(aes(x = Magnitude), colour='black', alpha=.2) series <- subdata$Magnitude - 6.5
k <- mean(series)
# rate <- 2
probs <- ppoints(series)
q.sample <- sort(series)
q.gmm <- qgamma(probs, 1.1, 1/k)
q.wbl <- qweibull(probs, .95, k)
g <- ggplot() +
geom_point(aes(y = q.sample, x=probs), colour = 'steelblue', alpha = .05,size=2) +
geom_line (aes(y = q.wbl, x=probs)) +
geom_line (aes(y = q.gmm, x=probs), colour='brown') +
labs(x = 'Probability Points', y= 'Sample Quantiles')
ggplotly(g)Deliverables
1
- Choose a continuous variable (not discrete) and plot the histogram:
- change fill and/or colour to make your plot fancier
- interpret the graph, is it left-right skewed?
- are tails irregularly thich in certain ranges?
- Also calculate 4 moments and interpret
- Is it left or right skewed?
- Is it platykurtic or leptokurtic?
2
Choose a continuous variable (not discrete), or generate one:
- Plot the sample quantiles agains probability points (
probs) - Is it unimodal or multimodal?
- What are the first, second (median) and third quantiles? Show how they can be located on quantile plot.
- What are the min and max value?
- Are there more points before (on the left of) the median or after (on the right of) the median?
3
If your data is left skewed find or create another data (which is rightskewed or not-skewed). Fit a distribution to the same continuous variable:
- Use qqtest to check whether your data is normal, lognormal, chisquared or exponential
- Plot the sample quantiles agains probability points (
probs). Generate ideal quantiles and add as a line on the quantile plot.